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ABSTRACT 



Focusing on the experience of the Yale University 
(Connecticut) social science data preservation project, this document 
presents a case study of migration as a preservation strategy, exploring 
options for migrating data stored in a technically obsolete format and their 
associated documentation stored on paper. The first section provides 
background and a project description, an overview of the Yale Roper 
Collection of public opinion research data sets and paper records, and a 
summary of the literature search. The following nine steps of the data track 
are described in the second section: identify equipment; copy files from 
mainframe-based media to local hard disks; examine documentation; define the 
column binary format; develop standard variable-naming classifications; read 
in data with SAS (Statistical Analysis System) and SPSS (Statistical Package 
for the Social Sciences); identify migration formats; recode data files with 
SAS; and create spread ASCII data files without recoding. The next section 
addresses the documentation track, including software /equipment , TextBridge 
Pro optical character recognition software, PDF (portable document format) 
files from Adobe Capture, and HTML and SGML/XML marked-up files. Findings and 
recommendations are presented in the fourth section, including user 
evaluation, findings about data/documentation conversion, and recommendations 
to data producers. A glossary is included, and support documents are 
appended. Contains 18 references. (AEF) 
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The Digital Library Federation 

On May 1, 1995, 16 institutions created the Digital Li- 
brary Federation (additional partners have since 
joined the original 16). The DLF partners have commit- 
ted themselves to "bring together — from across the 
nation and beyond— digitized materials that will be 
made accessible to students, scholars, and citizens 
everywhere." If they are to succeed in reaching their 
goals, all DLF participants realize that they must act 
quickly to build the infrastructure and the institutional 
capacity to sustain digital libraries. In support of DLF 
participants' efforts to these ends, DLF launched this 
publication series in 1999 to highlight and disseminate 
critical work. 

Donald J. Waters 

Director 

Digital Library Federation 
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Preface 

Quantitative data, including social survey results, test measurements, 
economic and financial series, and government statistics, are vital re- 
sources for research and education in a variety of disciplines con- 
cerned with advancing the study of individuals and society. For de- 
cades, these data have been encoded, stored, and used primarily in 
digital form. Custodians who have collected, maintained, and provid- 
ed access to numeric data resources thus have been building and man- 
aging digital libraries — and scholars and students have been effective- 
ly using them in the pursuit of historical, social, and scientific studies 

long before the term digital library came into wide currency. 

Those who are grappling with an explosion of digital information 
in a dizzying range of formats have much to learn from social science 
data librarians and users who have relatively long experience in man- 
aging and working with digital resources. Data producers, librarians, 
and scholarly users have come to invest in very sophisticated mecha- 
nisms for storing and distributing social science data. They have 
achieved valuable economies of scale in data storage and delivery 
through consortial developments, such as the data archives held by 
the Inter-university Consortium for Political and Social Research 
(ICPSR). Through years of experience with repeated changes in stor- 
age technologies and in the software for encoding and using the data, 
they have become particularly adept at the long-term maintenance of 
information in digital form. 

In 1996, the Task Force on Archiving of Digital Information high- 
lighted the difficulties of preserving digital information over long pe- 
riods of time. As a way of addressing these difficulties, the task force 
recommended in part that its sponsors, the Commission on Preserva- 
tion and Access and the Research Libraries Group, seek to document 
the experiences of communities already well practiced in the preserva- 
tion of digital information. Responding to this recommendation, the 
Commission, which has since merged with the Council on Library Re- 
sources to become the Council on Library and Information Resources 
(CLIR), sought out the expertise of those managing university-based 
data archives. It contracted for the development of this paper with the 
authors, who at the time worked together in managing the Social Sci- 
ence Data Archives at Yale University, one of the oldest data archives 
in American universities. 

Preserving the Whole appears as the second publication of the Digi- 
tal Library Federation and reflects the Federation's interests both in 
advancing the state of the art of social science data archives and in 
building the infrastructure necessary for the long-term maintenance of 
digital information. The paper is especially valuable as a meticulously 
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detailed case study of migration as a preservation strategy. It explores 
the options available for migrating both data stored in a technically 
obsolete format and their associated documentation stored on paper, 
which may itself be rapidly deteriorating. The obsolete data format 
known as column binary was born in the same era of creatively parsi- 
monious coding techniques that have given rise to the widely publi- 
cized Year 2000 (Y2K) computer problems. 

Beyond its contributions to our understanding of migration as a 
particular strategy for the long-term maintenance of digital informa- 
tion, Preserving the Whole also provides more general lessons. It is a 
remarkable finding of this study that the column binary format, al- 
though technically obsolete, is so well documented that numerous op- 
tions exist not just for migrating column binary files to other formats, 
but also for reading them in their native format. Moreover, the authors 
make the important observation that data sets will be indecipherable 
and cannot survive at all, regardless of the file format in which they 
are stored, if there is no effort made also to preserve their codebooks. 

A codebook is essential documentation that relates the numeric data 
to meaningful fields and values of information. 

From more theoretical perspectives, Jeff Rothenberg (1999) and 
David Bearman (1999) both emphasize the critical importance of docu- 
mentation, or metadata, for preserving digital information. The value 
of Preserving the Whole is that it makes a similar argument, but con- 
cretely and from the long experience of the data community in effec- 
tively managing digital information. 



Donald J. Waters 
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Background 
and Project 
Description 



I n December 1994, the Commission on Preservation and Access 
and the Research Libraries Group created the Task Force on 
Archiving of Digital Information. The purpose of the task 
force was "to investigate the means of ensuring continued access 
indefinitely into the future of records stored in digital electronic 
form." Digital media are more fragile than paper and become un- 
readable more quickly because of changes in operating systems and 
applications software and the deterioration of physical media, and 
because no organization has accepted responsibility for preservation. 
In its definitive 1996 report. Preserving Digital Information, the task 
force warned that ^^owners or custodians who can no longer bear the 
expense and difficulty of migration will deliberately or inadvertently, 
through a simple failure to act, destroy the objects without regard for 
future use." 

The task force's warning echoed the growing realization by re- 
searchers who were using social science statistical data in digital 
form and specialists who were archiving these data that major rescue 
efforts to identify, locate, and preserve computer files produced with 
rapidly outmoded technology could not be postponed. Because ac- 
cess to social science numeric data requires metadata — accompany- 
ing paper or machine-readable records — the loss of the metadata can 
also mean the loss of the data file. 

The two approaches to preservation of digital files under evalua- 
tion in the early 1990s were refreshing and migration. Refreshing refers 
to the copying of information from one medium to another without 
changing the format or internal structure of the records in the files. 
Refreshing digital information will suffice as long as software exists 
to manipulate the format of the files. Since digital information is pro- 
duced in varying degrees of dependence upon particular hardware 
and software, refreshing cannot serve as a general solution for pre- 
serving digital information. The task force emphasized migration of 
digital information, "designed to achieve the periodic transfer of dig- 
ital materials from one hardware/software configuration to another, 
or from one generation of computer technology to a subsequent gGu- 
eration." Migration includes refreshing the media but also addresses 
the internal structure of the files so that the information within can 
be read on subsequent computer platforms, operating systems, and 
software. 



The Yale Social Science Data Preservation Project 

In 1996, the Commission on Preservation and Access commissioned 
the Social Science Library and the Social Science Statistical Laborato- 
ry at Yale (Statlab) to identify and evaluate the formats that would 
most likely provide the ability to migrate social science statistical 
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data and accompanying documentation^ into future technical envi- 
ronments. The Yale University Library, one of the first academic li- 
braries to form a collection of machine-readable data, began acquir- 
ing social science numeric data in 1972. Over the years, Yale has 
copied its data from one form of digital storage to another as main- 
frame computer technology has dictated. The copying of data, while 
labor-intensive, was straightforward in creating exact logical copies 
from out-of-date media in newer data storage formats. In the mid- 
1990s, as data use was moving from the mainframe to distributed 
Computing systems and from one hardware/ software configuration 
to another, digital formats began to require not just simple duplica- 
tion, but restructuring. Files produced by standard statistical soft- 
ware on mainframes had to be converted into platform-independent 
formats before moving to personal computers. In addition, data 
stored on magnetic tapes had to be moved to new media as access to 
and support in using the Yale mainframe was discontinued. 

Our social science data preservation project team was headed by 
Ann Green (director, Statlab) and JoAnn Dionne (data librarian. So- 
cial Science Library) with the assistance of Martin Dennis (consult- 
ant, Statlab, and graduate student, psychology). We began our work 
in June 1996, during a time when many academic institutions were in 
the process of transferring numeric social science data sets from 
mainframe environments to PC- and UhllX-based networks. Large 
collections of numeric data had been successfully moved across these 
platforms. Considerably less attention had been directed toward the 
greater problem of developing system-independent archival formats, 
while also preserving and digitizing the accompanying paper 
records (metadata) that must be available to analyze the data sets. 

We were thus faced with a two-track preservation approach: con- 
verting deteriorating paper (the documentation) to digital form, and 
migrating digitized numeric data to an archival format that can be 
read by future operating systems and applications software. The Yale 
University Library had taken a lead in digitizing for preservation 
(Conway 1996), and we built on that base in digitizing the paper 
records accompanying the data file. The Statlab had taken a lead in 
migrating data collections from mainframe-dependent tape storage 
to networked online storage, and we built on that base in restructur- 
ing and migrating the numeric files. 

On the documentation track, we scanned printed textual materi- 
al for 10 surveys selected from the Yale Roper Collection and evalu- 
ated the outcomes of applying optical character recognition (OCR), 
creating image files, and producing Adobe Portable Document For- 
mat (PDF) files. On the data track, we investigated diligently and in 
detail the implications of preserving data in their original format vs. 
migrating to restructured formats. We evaluated the alternative for- 
mats for migrating the original data files from tape and focused 
upon the benefits and drawbacks of each alternative. Details are cov- 
ered in the Findings and Recommendations section of this report. 



I At its first occurance, a word defined in the Glossary is shown in bold. 
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While evaluations of computer storage media should not be ig- 
nored in an overall strategy for planning the future costs and viabili- 
ty of data collections, we did not include media evaluations in this 
project. Nor did we research the intellectual property issues involved 
in conversion, leaving that to a later discussion. However, there is a 
long-established ethic in the social science data community that data 
documentation should be shared freely. For example, the Inter-uni- 
versity Consortium for Political and Social Research (ICPSR) recent- 
ly began making all its machine-readable documentation freely ac- 
cessible on the Internet. 

At the end of the project in the fall of 1997, we developed a col- 
lection of information, including sample programs and documents, 
relevant to the project and made it available at the Statlab Web page 
of the Yale Web site. The collection of information has since moved to 
the Council on Library and Information Resources' Web site at 
http: / / www.clir.org/ pubs/ reports/ pub83/ statlab. Included in the 
materials accessible at this site are: 

• a link to the Interim Report to the Commission on Preserva- 
tion and Access 

• programs to create spread ASCII data files 

• spread ASCII data file example 

• sample data map for spread ASCII data file 

• SAS programs for recoding data and producing ASCII data 
from SAS data files 

• link for downloading Adobe Acrobat Reader 

• multiple examples of Adobe PDF files 

The Roper Collection at Yale 

The Yale Roper Collection contains materials from the Roper Center 
for Public Opinion Research (the Roper Center), whose data sets 
comprise a rich resource for research in political psychology and so- 
ciology. They provide a record of public opinion research in the Unit- 
ed States from 1935 to the present, along with surveys conducted 
abroad since the 1940s. In addition to the data files, the Yale Roper 
Collection includes paper records such as questionnaires, informa- 
tion on sample sizes, and other notes necessary for use of the data 
files. Many of the paper records are brittle, have handwritten notes, 
and were produced through unstable copying technologies such as 
mimeography. 

The first step in the project was to select a representative group 
of documents and accompanying data files from the collection. Our 
initial discussions led us to select the Roper Reports, a significant, 
heavily used part of the Yale Roper Collection. The Roper Reports 
have been produced since 1973 by the Roper Organization, a com- 
mercial polling company now known as Roper Starch Worldwide, 
Inc. The Roper Reports have 1,500-2,000 respondents, 200-300 vari- 
ables, and polling for the reports is conducted 10 times per year in 
the United States. Data files contain demographic information such 
as age, sex, race, economic level, education, marital status, union 
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rnernbership, religious and political affiliation, and responses to 
questions on a broad array of issues facing society such as energy, 
politics, media, health and medical care, consumer behavior, educa- 
tion, and foreign policy. 

The Roper Reports in the Yale Roper Collection do not have ma- 
chine-readable documentation supplied with the data files. The doc- 
umentation consists of paper photocopies of questionnaires and 
computer output. Some parts of the documentation are poorly dupli- 
cated copies with blurred text on a gray background and some ques- 
tionnaires have handwritten notes in the margins. Most of the ques- 
tionnaires are printed in multiple columns on a page with no 
standard format or layout. The Roper Reports documentation collec- 
tion thus represents the problems inherent in the rest of the Yale Rop- 
er Collection. Of the 200 Roper Reports in the Yale Collection at the 
time of the project, 10 studies were selected across the full span of 
years to include any differences in format or documentation. 

Our selection of the Roper Report data files was particularly im- 
portant in the context of migrating data files. The files were stored in 
column binary format with portions of the files coded in an archaic 
format based upon the IBM punch card. The responses of a single 
case or individual interview were represented on one or more punch 
cards. Each punch card had 80 columns and 12 rows. The non-col- 
umn binary format allowed a maximum of one character per column 
and a maximum of 80 variables per card. The column binary format, 
however, made it possible to store more than one variable in the 
same column. With punches allowed in each of the 12 rows, the max- 
imum number of items was increased by up to a factor of 12. This 
column binary format was especially popular in the 1960s and 1970s 
when information was stored almost exclusively on computer cards, 
making it desirable to compress the data into as small a space as pos- 
sible, because it provided space for multiple answers to a single 
question. 

Special instructions must be given in software programs to de- 
fine this unique column and row structure. Since the format is based 
upon old technology, knowledge about its use and software input 
formats to read it are increasingly rare. Our challenge was to find a 
new format that preserved the full intellectual content of the binary 
coding while allowing current and future technology to read the data 
and convert the computer card punches into meaningful values. 



Literature Search 

We reviewed the library literature for this project and conducted 
searches of the Internet. Discussion of the issues involved in ar- 
chiving digital information had been well detailed in Preserving 
tal Information, so we limited our search to topics specific to the pres- 
ervation of social science numeric data and documentation. The 
literature search revealed much information on imaging as a preser- 
vation technique for books but little on preserving documentation 
for data files (see Reference List). We uncovered no previously pub- 
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lished material on methods of preservation of electronic materials, 
other than duplicate copies moved from one storage medium to an- 
other. We found little information on the subject of copying data files 
and changing the way they are coded. We searched for reports on the 
conversion of multiple-punched data to other formats but found 
nothing. Nor did we find any discussion of standards for such con- 
versions nor of the validity of various numeric data storage formats 
as archival media. 

In addition, we inquired of the Center for Electronic Records of 
the National Archives and Records Administration (NARA), Archi- 
val Research and Evaluation Staff, to identify any standards they fol- 
low internally. NARA retains numeric data in the format they are re- 
ceived but will transform them on request (Adams 1996). There had 
been discussion among members of the data archive community 
about whether column binary was an acceptable archival format, but 
we found no published discussion of this issue. We also searched for 
reports on the use of proprietary formats in archiving electronic 
records. Again, we found almost no mention of numeric data in the 
published literature. 

On the Web site of the ICPSR, we found one discussion of the 
conversion of questionnaire-type information from paper to electron- 
ic formats using OCR as opposed to imaging. This type of informa- 
tion may also be found in the business records management litera- 
ture, which we did not review. JSTOR, the Journal Storage Project, 
was making images of journal pages available to subscribers via the 
Web and using OCR to index the pages (JSTOR 1996). This approach 
seemed to overcome the limitations of using either imaging or OCR 
technology alone. 

We concluded that we needed to extrapolate from the more gen- 
eral literature on archiving textual data, which emphasized the desir- 
ability of storing information in formats independent of hardware 
and software (NARA 1990). The perils of using formats that depend 
on hardware and software in the case of textual data had been de- 
scribed by Jeff Rothenberg (1995). We had no reason to expect that 
numeric data would be any different. 

During 1995 and 1996, we followed discussions on the informal 
list for ICPSR Official Representatives and the listserv for members 
of the International Association for Social Science Information Ser- 
vice and Technology, especially on the use of PDF for storage and 
distribution of codebooks. The discussions focused on the concern 
that PDF was not an acceptable archival format and would require 
reformatting during the lifetime of the documents. ICPSR also pub- 
lished a discussion of this issue (1996). 
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The Data Track The preservation project's goals for the data track were to develop 

and evaluate a process of migrating digital numeric information from 
computer tape to hardware- and software-independent formats and 
to evaluate the utility of the resultant /ormafs. The process of migrat- 
ing was broken down into a series of nine steps. 

1 . Identify equipment 

2. Copy files from mainframe-based media to local hard disks 

3. Examine the documentation 

4. Define the column binary format 

5. Develop standard variable-naming classifications 

6. Read in the data files with SAS and SPSS 

7. Identify migration formats 

8. Recode data files with SAS 

9. Create spread ASCII data files without recoding 



7. Identify Equipment 

The computing environment used for the project consisted of IBM 
mainframes (all commands were submitted as JCL batch jobs issued 
from a CMS-based IBM mainframe to an MVS mainframe with tape 
access), and PC/Intel (Pentium 90) computers running Microsoft 
Windows for Workgroups on a Novell network. 



2. Copy Files 

We copied the column binary data files from old round reel tapes to 
new 3480 IBM cartridges on the Yale mainframe. This was the first 
step in refreshing the data from the old tape medium to a more sta- 
ble magnetic medium. Next, the data were copied from the cartridg- 
es to mainframe disk and moved via standard file transfer protocols 
(in binary mode) to the Statlab Novell server, the home for the SAS 
program writing, record keeping, and output storage. 

Once transferred to disks connected to the server, each data set 
was checked with a simple program that read in one variable— the 
deck or card number — for each observation in the original data file. 
The number of observations for each deck number was then com- 
pared with information in the documentation. Any discrepancies 
were noted and data files reordered from the Roper Center if errors 
were confirmed. 



3. Examine Documentation 

The primary document for a particular Roper Report was simply a 
copy of the original survey questionnaire containing the questions 
asked in the survey. The card number and column number (or num- 
bers) for locating the question in the record for each respondent were 
given for each question. Furthermore, the punch location was listed 
for each response option in the question. All of this information — 
card number, column location, and punch location — was necessary 
for reading the original column binary data. 
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The questionnaires varied in length and, in some cases, multiple 
versions of a questionnaire were provided. For example, some of the 
surveys used a split sample, either to ask different questions or to 
ask the same questions in different sequence. Often, the questions 
asked of the two groups in the split sample were not identical. In 
some cases, the column location and format of the different variables 
were identical. The samples could differ in the order of items in a 
multiple-response question, or in using slightly different sets of 
items. In other cases, the format of the questions (along with their 
column and row locations) changed radically between samples. 

The xray, a special form of printed output supplied by the Roper 
Center, provided a response frequency used to check the data during 
the migration processes. Organized by card, column, and row, the 
xray gave the total number of punched bits across observations for 
each variable in the data set. The xray also provided useful informa- 
tion about the types of questions being answered and the complexity 
of reading the column binary format. Because each question was 
usually encoded in its own column, the sum of all of the responses 
(plus any blank observations), for most types of questions, should 
add up to the total number of people in the survey. If it did not, then 
the question allowed multiple responses and would need to be read 
in a different manner from single response variables. 



4 . Define Format 

The structure of the column binary format is illustrated in table 1. 

The sample question shown in the table uses two columns (47 and 
48) to cover all the possible answers. The full text of this question 
may also be seen in Appendix 1. Each card contains 12 rows num- 
bered from top to bottom, beginning with 1 at the top. Punch num- 
bers are assigned in a different way: a 12 punch goes in the top row, 
an 11 punch in the second row, and the third through twelfth rows 
are reserved for 0 through 9 punches. The 11 and 12 punches are of- 
ten used for "don't know," "no answer," or "not applicable" respons- 
es. In this example, a respondent chose the values 1, 2, and 4 ("living 
in poverty," "being abused as a child," and "drug abuse"), so punch- 
es were made in rows 4, 5, and 7, as shown in the last column of the 
table. 

In this question, both columns 47 and 48 may have multiple 
punches representing the choices people were allowed to select from 
the list of causes of violent crime. Alternatively, in the non-column 
binary scheme allowing only one punch per column, each possible 
selection would have to be coded as a separate variable and therefore 
as a separate column, so coding of this question would take up 18 
columns rather than two. 



5 . Develop Standard Classifications 

A standard classification of types of variables, based upon the types 
of questions asked in the surveys, was used in the construction of 
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Table 1. Question coding example 

Roper Report 9309, Question 8W: "...which three or four things do you think are the main causes of 
people committing violent crimes?" 



Responses 


Column 

LOCATION 


Row 


Punch 

NUMBER 


Punched 


a. Living in poverty 


47 


4 


1 


X 


b. Parents not teaching right from wrong 




5 


2 


X 


c. Being abused as a child 




6 


3 




d. Drug abuse 




7 


4 


X 


e. What people see in TV programs 




8 


5 




f. A lack of morals 




9 


6 




g. A lack of education 




10 


7 




h. A person not seeing any harm in it 




11 


8 




i. A person being irresponsible 




12 


9 




j. Influence of friends 




3 


0 




k. Alcohol abuse 




2 


11 (or X) 




1. What people see in movies 




1 


12 (or Y) 




m. The advertising and marketing of toy guns, etc. 


48 


4 


1 




n. Guns being too easy to get 




5 


2 




o. Low chance of being punished 




6 


■ 3 




p. Seeing pornography 




7 


4 




None of these 




8 


5 




Don’t know 




1 


12 (or Y) 





data set translation. The classification provided a scheme for the cre- 
ation of standard templates and the logic behind the variable types 
used in recoding the data. In addition, the classification made it easi- 
er to construct appropriate variable names and to ensure consistent 
naming across data sets. Each of the variable types we defined re- 
quired different variable-naming, formatting, and recoding procedures. 

The four basic types of variables are regular numeric variables, 
numeric variables with special missing values, multiple-response 
questions, and single- response questions. Each variable type was as- 
sociated with a different template for code creation, a specific form of 
variable names, and certain projected difficulties in recoding. For in- 
stance, a multiple-response variable was read in with a simple list of 
single-punch variables, was labeled simply with letter suffixes, and 
was not recoded. In contrast, an aggregated single-response variable 
was read in with a list of single-punch intermediate variables, was 
labeled with intermediate number suffixes, and was recoded into a 
final variable for the entire question. Retrieving the variable type di- 
rectly from the documentation provided a guide to the particular 
piece of program code that was necessary for inputting and recoding 
the variable. 
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Regular numeric variables. Regular numeric variables range from 0 
to 9. The values are computed by filling in each digit with the num- 
bers in the appropriate columns. For example, Roper Report 9309, 
Question 68Y, asked "Do you think government lotteries produce an un- 
wholesome gambling spirit in this country? Yes=l, No=2" 

Numeric variables with special missing values. These variables are 
identical to regular numeric variables, except that there can be one or 
two special missing values (such as ''don't know"), recorded at 
punch 11 and/or punch 12. For example, Roper Report 9309, Ques- 
tion 2X, asked, "Do you feel things in this country are generally going in 
the right direction today, or do you feel that things have pretty seriously 
gotten off on the wrong track? Right track=l. Wrong track=2. Don't 
know=Y" 

Multiple-response questions. Multiple-response questions generally 
ask respondents to choose more than one option from a list of items, 
as in Roper Report 9309, Question 8W illustrated in table 1. If an item 
was checked, it was coded as 1 in the recoded data set; if it was not 
checked, it was coded as 0. Therefore, for any such question, there 
would be multiple binary variables corresponding to all of the possi- 
ble responses (including special missing values), so that each possi- 
ble response became a variable in a final recoded data set. 

Single-response questions. Single-response questions allow one, 
and only one, response per item. They frequently include a special 
missing value at punch number 12, with several answer options be- 
tween punch numbers 1 and 9. These questions require appropriate 
recoding for the final variable. For example, Roper 9309, Question 
7Y, asked, "And thinking about crime in the United States, what one type 
of crime do you feel presents the biggest threat for you and your family to- 
day?" An example of multiple-column storage of a single-response 
question appears in table 2. 



Table 2. Example of multiple-column storage of single-response question 



Response 


Punch 

NUMBER 


Column 

LOCATION 


a. Burglary, robbery, auto theft 


1 


45 


b. Vandalism/hooliganism 


2 




c. Official corruption, government bribe-taking 


3 




d. Murder 


4 




e. Rape 


5 




f. Assault 


6 




g. Racketeering, extortion 


7 




h. Drugs 


8 




i. Prostitution 


9 




j. Speculation and swindling 


0 




k. Other 


1 


46 


None of the above 


2 




Don't know 


Y 






6. Read in Data 

The first step in the SAS programming process was reading in the 
data using one of the three SAS column binary informats: 
PUNCH.d, CBw.d, and ROWw.d. The informat PUNCH.d reads a 
specific bit (with a value of either 0 or 1) in the original data set. The 
d value indicates which row in a column of data is to be read. The 
reformat CBw.d, on the other hand, looks to see which one of the 10 
numeric bits (0 through 9) has been punched, and returns that num- 
ber as a value for the column. ROWw.d begins reading at a user- 
specified bit, looks to see which one of a user-specified number of 
bits (after the first) has been punched, and returns a number equal to 
the relative location of the punched bit. For instance, a ROWS 5 infor- 
mat would start reading at PUNCH.5 and continue for four more 
bits through PUNCH.9; if bit 8 was punched, then the ROW5.5 infor- 
mat would return a 4. The PUNCH.d informat was the most appro- 
priate for this project. For clarification of its use, refer to the sample 
SAS program in Appendix 2. Depending on the final form of the 
question, the pattern of punches in a column usually had to be logi- 
cally recoded later in the program from an intermediate variable to a 
final variable that matched the response options in the original docu- 
mentation. 

The Statistical Package for the Social Sciences (SPSS) is also able 
to read data, including multiple-response question responses, from a 
column binary data file by using the keywords MODE=MULTI- 
PUNCH on the FILE HANDLE command. An example of an SPSS 
program that reads a single variable from Roper Report 9309 data 
with SPSS is: 



file handle test / name='h:/roper/rprr9309.bin' /recform=fixed / 
mode=multipunch/Ired=160. 

data list file=test records=9 / I feddep 24 (A) /9. 
execute. 

In this example, a variable called FEDDEP is read in column 24. It 
has a possible Y coded as "don't know", requiring that SPSS read 
this variable as a character string (see also SPSS 1988, 84-86). 



7. Identify Migration Formats 

We selected the following formats to test the migration process: 

SAS system files of recoded column binary data, with and 
without intermediate variables 

• SAS system files with shortened integer byte lengths 

• SAS export files of recoded column binary data 
ASCII files produced from recoded column binary data 
ASCII files of the binary data patterns in the original file, 
called spread ASCII 

These formats were selected on the basis of the software's ability to 
read the column binary format, the availability of project staff pro- 
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gramming expertise, the transportability of output formats, storage 
requirements (size of output data sets), and long-term archival impli- 
cations. While both SAS and SPSS software are able to read column 
binary data, the staff members working on the project had more ex- 
perience with SAS, so we chose to work with that statistical package. 
With the exception of spread ASCII data files, each format we select- 
ed for testing contained completely recoded and renamed variables. 
Storage requirements were compared on the different platforms for 
many of the different types of SAS data files, as shown in Appendix 3. 



8. Recode Data Files 

Create intermediate variables and code missing values. After reading in 
the data with the SAS informat statements, the second step in the 
SAS programming process was to produce a set of if-then statements 
for recoding the individual punch data (the intermediate variables) 
into final variables. Each intermediate variable needed to have a col- 
umn location, row location, variable name, and SAS informal in- 
structions specified in the input statement, as seen in the example 
shown in table 3. The only things that changed while inputting this 
variable Q14 were the suffix and punch location for each intermedi- 
ate variable. This redundancy increased when there were more inter- 
mediate variables, and especially with aggregated single-response 
variables. The logical statements used for recoding the intermediate 
variables also contained many repetitions of the same if-then com- 
mands. 

Since the creation of these translation programs in SAS involved 
a large amount of repetitive typing, we created templates for both 
the input and recoding statements. Templates were created for sin- 



Table 3. Example of variable location, name, and SAS informat statement 



Column location 


Variable name 


Row LOCATION AND 
INFORMAT 


@30 


Q14A_1 


PUNCH. 1 


@30 


Q14A^2 


PUNCH.2 


@30 


Q14A_3 


PUNCH.3 


@30 


Q14A_Y 


PUNCH. 12 


@31 


Q14B_1 


PUNCH.l 


@31 


Q14B_2 


PUNCH.2 


@31 


Q14B_3 


PUNCH.3 


@31 


Q14B_Y 


PUNCH.12 
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gle-response variables (simple and aggregated), multiple-response 
variables, and numeric variables with special missing codes. Each 
template included generic characters for column locations, variable 
names, and SAS informats, as well as pre-formed logical recoding 
statements. Furthermore, the templates contained repeated lines of 
code for different numbers of punches or aggregate variables so that 
it would not be necessary to enter the redundant information for the 
different variables. 

These templates considerably sped up translation code creation. 
Once the type of the current variable had been determined, the ap- 
propriate input and recode statements were copied from the tem- 
plate file, pasted into the program code, and then the key characters 
(such as column location and variable name) were replaced using a 
find /replace function in the text editor. For example, to create the 
code for variable Q14, a piece of the template would have been cop- 
ied from the template file and pasted into the program, as illustrated 
in table 4. The aggregated single-response variable template includ- 



Table 4. Examples of SAS input statements and recoding statements 



/* Three options + NR */ 

(for INPUT command) 

@COLl Q000A_1 PUNCH.l 
@COLl Q000A_2 PUNCH.2 
@COLl Q000A_3 PUNCH.3 
@COLl Q000A_Y PUNCH.12 
@COL2 Q000B_1 PUNCH.l 
@COL2 Q000B_2 PUNCH.2 
@COL2 Q000B_3 PUNCH.3 
@COL2 Q000B_Y PUNCH.12 

(for recoding statements) 

IF Q000A_1 =1 AND Q000A_2=0 AND Q000A_3=0 AND Q000A_Y=0 
THENQ000A=1; 

IF Q000A_1 =0 AND Q000A_2=1 AND Q000A_3=0 AND Q000A_Y=0 
THEN Q000A=2; 

IF Q000A_1=0 AND Q000A_2=0 AND Q000A_3=1 AND Q000A_Y=0 
THEN Q000A=3; 

IF Q000A_1 =0 AND Q000A_2=0 AND Q000A_3=0 AND Q000A_Y=1 
THEN Q000A=9; 

IF Q000B_1 =1 AND Q000B_2=0 AND Q000B_3=0 AND Q000B_Y=0 
THENQ000B=1; 

IF Q000B_1=0 AND Q000B_2=1 AND Q000B_3=0 AND Q000B_Y=0 
THEN Q000B=2; 

IF Q000B_1 =0 AND Q000B_2=0 AND Q000B_3=1 AND Q000B_Y=0 
THEN Q000B=3; 

IF Q000B_1 =0 AND Q000B_2=0 AND Q000B_3=0 AND Q000B_Y=1 
THEN Q000B=9; 
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ed input and recode statements for sub-questions from A through Z; 
only A and B are shown in table 4. 

Then the string QOOO would be replaced with Q14, and COLl 
and COL2 would be replaced with 30 and 31, respectively, resulting 
in the final code for inputting and recoding the variables Q14A and 
Q14B. 

Create macro programs to recode data files. Given that we could de- 
fine fairly well the different types of variables in a data set, and that 
we could create input and recode template statements for these dif- 
ferent variable types, it was tempting to think that we would be able 
to write programs to automate the whole procedure. That is, we 
might imagine a program that could perform all of the operations 
described above with minimal effort from the human operators. Al- 
though such automation may be possible in the future, the irregulari- 
ty of the data files presented a major obstacle. Not only would an au- 
tomatic translation program have to deal with some of the 
complexities of, for example, split samples with different variable 
types in the same column, but it would also have to handle the dif- 
ferent types of errors that occur in the original data files. 

For example, one fairly straightforward solution (though quite a 
lot of work) would have been to create a program that received some 
sort of variable list from a human operator, set up logical conditions 
to create code around split samples, and then write out a SAS pro- 
gram to translate and recode the data file. More sophisticated error- 
checking would have been necessary, however, to handle simple 
events such as a variable being incorrectly marked as a single-re- 
sponse type in the documentation, but actually having multiple re- 
sponses coded in the data. Because the recoded data files needed to 
be of archival quality, this error-checking would have had to be quite 
rigorous. Furthermore, judgment calls would have sometimes aris- 
en when dealing with irregularities (that is, should the variable be 
recoded differently or should the irregularity be ignored), so that 
leaving the decision solely to a computer program was not advis- 
able. In short, creating an automatic translation program to recode 
the data would have involved several compromises that we would 
not recommend. 

Debug programs and check data. Several types of errors occurred in 
the SAS programming: typographic errors, invalid informats, and 
unexpected changes in variable types (where the questionnaire did 
not match the data). The key indicators for these problems were IN- 
VALID DATA warnings that appeared in the SAS log. 

Once the programs ran without INVALID DATA messages, the 
accuracy of the translation needed to be checked. Complete frequen- 
cies for all of the variables in the final data set had to be compared to 
the frequencies in the xray. 
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For example, if variable Q9 (deck 1, column 30) had the follow- 
ing frequencies (in this case, PUNCH. 12 was recoded to the missing 
value of 9): 



Value 


Frequency 


1 


290 


2 


1300 


9 


403 



then the xray for deck 1, column 30 should look like this : 



Punch 

LOCATION 


12 


n 


0 


1 


2 


3 


4 


5 


6 


7 


8 


9 


Sum 


403 


0 


0 


290 


1300 


0 


0 


0 


0 


0 


0 


0 



It was necessary to check all of the variables in the final data set, 
because one badly coded variable in the conversion job could com- 
promise the archival integrity of the entire data set. Unfortunately, a 
random subset of variables may miss the errors. If there was a mis- 
match between the xray and the frequencies, the difference needed to 
be tracked down and the SAS job edited. This frequency check was 
also the last chance to resolve ambiguities between the type of vari- 
able listed in the documentation and that shown in the xray. One fi- 
nal point to note during this check was the maximum number of 
possible digits in a recoded variable. If the data set was to be output 
in ASCII forma t, then the number of digits in the special missing val- 
ue should be equal to the number of digits in the regular values. That 
is, if a variable had values 1 through 3, then the special missing value 
should be 9; if it had values 1 through 12, then the special missing 
should be 99; and so on. 

Sava rccodad data. The final step in the reading and recoding pro- 
cess was to write to disk the resultant SAS data sets. For recoded SAS 
to SAS export files, we saved the files as SAS system files and export 
files with all intermediate and recoded variables, and as SAS system 
files and export files with just the recoded variables (see Appendix 3). 

For recoded SAS to ASCII, once the SAS data set had been com- 
pletely debugged and double-checked, PUT statements could be 
written to create an ASCII version of the recoded data (see Appendix 
2). The simplest way to write these PUT statements was to copy the 
INPUT statements from the original job (or from the recoding state- 
ments), strip off the column binary informat information, and manu- 
ally type in the new column locations for each variable. These PUT 
statements had two advantages: unlike automatic translation pro- 
grams, they created ASCII data files with minimum wasted space 
and the PUT statement itself could be distributed as a data dictionary. 

For the first data set translated, an ASCII version of the recoded 
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SAS data set was created using an automatic translation program 
called DBMSCOPX version 5.10. However, this ASCII data set was 
not satisfactory. Although the translation program automatically cre- 
ated a data dictionary (based on the variable names in the SAS data 
set), the ASCII data set was not compact. The translation program 
seemed to be incapable of concatenating one variable's width of col- 
umns against the previous variable's columns, so that there would 
be no wasted space within the flat data set. Instead, the program in- 
serted multiple spaces between each variable, which allowed a dif- 
fering number of characters within each variable, but also inserted 
approximately 10 ASCII space characters between each variable. Be- 
cause of the enormously increased storage requirements this inser- 
tion causes, this approach to creating ASCII data files was aban- 
doned and program code for creating compact ASCII data files was 
written in SAS. 

9 . Create Spread ASCII Data Files 

The original column binary data files did not necessarily have to be 
recoded during the translation process. Another possible method of 
translation was simply to convert, or spread, the column binary into 
ASCII data. Spread ASCII data files keep the binary structure of the 
column binary data, but encode each 0 or 1 as an ASCII character. 

For example, the following column in the original data (each 0 or 1 is 
one bit) 

0 

0 

0 

0 

1 

0 

0 

0 

0 

0 

0 

0 

would become 000010000000 in the spread version. While this exam- 
ple uses a single-response variable, multiple-response questions may 
have more than one bit with a value of 1 in the column. To create a 
spread version of the data, each bit (or punch) in the column binary 
file could be read (via SAS in our case) as an individual variable; 
each variable was then written to a new ASCII data set. Because the 
punches in the column binary data were rectangular, and because the 
variables being written out did not have to be meaningfully recoded, 
the SAS code itself was largely a simple iteration, over columns and 
rows, of IMPUT and PUT statements. 

The iterative nature of the SAS program suggested that a macro 
program could be written to automate the creation of SAS code. A 
simple C program was written based on this iteration over columns 
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and rows. After prompting the user for unique information — the 
name of the SAS program file to be created, the name of the data set 
to be created, the name and path of the column binary file, and the 
number of decks and record length (LRECL) of the original data — 
the C program simply looped over columns and rows. The program 
passed through the loop twice. In the first pass, the program created 
an INPUT statement that would read in each punch (looping over 
columns and rows) as a new variable. The second pass created a PUT 
statement in which each new variable would be written to an ASCII 
file. The C program was not itself reading or writing the data files; 
instead, it created many repetitive lines of SAS code to read and 
write the data files (see Appendix 4). 

Each card of column binary data was translated to a line of 
ASCII data. Since each card contained 960 data points (80 columns 
by 12 rows), each line of ASCII data contained 960 characters. The 
spread ASCII data set expanded about 600 percent from the original 
size. It took about 15 minutes to create the spread ASCII data on the 
IBM PC network, including all steps from running the C program to 
writing out the ASCII data. 

We investigated the formats currently distributed by the Roper 
Center and the Institute for Research in the Social Sciences at the 
University of North Carolina at Chapel Hill (IRSS) to determine their 
utility over time, and to evaluate them in light of our findings. Both 
distributors were producing what we call the hybrid spread ASCII 
data files from the original column binary files. These files are pro- 
duced by honzontally concatenuting the data into a new format. The 
first 80 columns of the data set contained the 80 columns of the origi- 
nal file; however, any binary encoding of variables with multiple- 
punch coding was converted to ASCII. The entire file was then con- 
verted to spread ASCII and the resultant data were appended to the 
end of the record. Users could access the non-binary data in the orig- 
inal column locations; if they needed to access data that were origi- 
nally punched in the binary mode, they could read those data from 
the spread ASCII portion of the record. Data dictionaries were pro- 
duced that mapped the location of variables from both the original 
80 columns and the spread ASCII portion of the record. 

We obtained sample programs from IRSS to help us evaluate these 
hybrid spread file formats. We also acquired sample data files from 
IRSS and the Roper Center, with their enhanced data dictionaries, to 
evaluate their utility. We then adapted a SAS program from IRSS to 
produce the data map showing the new column locations of the data 
items. This data map allowed for quick translation from the column- 
by-row information necessary to read column binary data, to the col- 
umn-only information necessary to read the spread ASCII data. A note 
at the top of the data map showed the order of punches in the spread 
ASCII data. For example, for a response located at column 56, row 1 in 
the original data set, the data map showed the same response's loca- 
tion as column 661 in the ASCII data set; a response at column 56, row 
5 would be located at column 665; and a response at column 56, row 
Y(12) would be located at column 672 (see Appendix 5). 
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The 

Documentation 

Track 



Several options were available for digitizing the documentation, in- 
cluding image scanning, OCR, combinations of image and text (PDF 
format), and text encoding. The initial part of the project included 
identifying formats to test and set up the scanning workstation. 
Complete documentation for each file was scanned, including the 
questionnaires, xrays, frequency counts, data maps, and other meta- 
data whenever available. 



Software and Equipment 

The scanner used was a Hewlett-Packard ScanJet 4c with a document 
feeder and SCSI connection to an IBM 750-P90 PC. All files were 
stored on an external SCSI Iomega Jaz drive. For a major scanning 
project, a fast machine with 32 MB of memory was required. Also, a 
PCI bus SCSI card speeded up transfer rates from the scanner to the 
computer. An automatic document feeder reduced the labor by auto- 
mating the page turning. Such labor-saving devices are cost-effective 
because scanning operations tend to absorb a lot of resources and to 
constrain work on other major tasks while the scanning is in 
progress. 

Scanners differ in speed, and, for a given scanner, speed varies 
with the desired scanning resolution. Speed could be an important 
factor in making a purchasing decision for a major project, as it can 
have a considerable impact on labor costs. For the HP 4c, the time it 
takes to scan a page varied with the desired resolution as indicated 
below. 



Desired Resolution 


Scanner Speed 


600 dpi 


30 seconds 


300 dpi 


7.5 seconds 


200 dpi 


3.3 seconds 



TextBridge Pro Optical Character Recognition 

We first scanned the documentation for one Roper Report using the 
OCR software TextBridge Pro. We reviewed alternative OCR soft- 
ware products and, finding no significant benefits to using one over 
another, chose a package with which the staff had experience. Initial 
evaluation of the OCR output showed that there were significant 
numbers of errors in the resultant ASCII text. We tested various reso- 
lutions and documented time taken for setup and for scanning, opti- 
mum settings, range of file sizes, quality of proofing summaries, and 
procedures to follow. 

The questionnaires we scanned with the TextBridge Pro software 
had an unacceptable rate of character recognition, including incor- 
rect location information necessary for manipulating the accompany- 
ing data files. Handwritten notes were completely lost and the edit- 
ing costs of reviewing the output and changing all errors would have 
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been prohibitive. This format did not present us with an adequate 
archival solution to preserving the textual material, so no further 
documentation was scanned using this process. TextBridge Pro does 
not work well with poor originals, determining optimum scanning 
settings was very time-consuming (and sometimes impossible), com- 
pression formats did not give good results, and raw ASCII format 
required time-consuming reformatting (see Appendix 6 for a photo- 
copy of a page from Roper Report 9309, Question 10, and for the 
TextBridge Pro sample output of the same question). 

Document condition. When used with printed clean originals, 

OCR is very accurate even when the font size is small, and can repli- 
cate the formatting of the original document. For example, Text- 
Bridge Pro is capable of producing low-resolution images and ex- 
porting both images and text to a word processing document that 
retains the format of the original printed page. However, there are 
far more problems with this process when the quality of the original 
is anything less than perfect, as in our case. TextBridge has a particu- 
lar problem with italics and underlining, even with good quality 
originals. 

TextBridge Pro also does not work well with columns and has 
some difficulty recognizing tables and columns in originals, let alone 
poor photocopies. This problem gets much worse when some of the 
entries in some of the table cells are blank because the columns get 
shifted; cleaning up the resulting output files becomes a major un- 
dertaking. 

Scanner settings. TextBridge Pro allows considerable control over 
the way in which the document is scanned in as an image. For some 
settings, this task can be delegated to TextBridge Pro by setting op- 
tions to automatic ; in other words, TextBridge Pro tries to figure 
out what works as it scans the page. But TextBridge Pro does not al- 
ways make these determinations successfully. Nor are photocopies 
ideal for scanning, particularly if not all of the characters are com- 
pletely clear and if not all of the pages are of the same lightness/ 
darkness. Successful recognition requires changing the settings peri- 
odically to account for the varying quality of the photocopies. Two 
outcomes are possible if the settings are not optimal: in some cases, 
the program is unsuccessful in recognizing text that is legible in the 
original; in others, it gives the frustratingly cryptic error, "page too 
complex for selected mode." 

We finally became convinced that there was no simple system 
for setting these options. In the worst case, it was a frustrating pro- 
cess of trial and error. Too dark a setting meant that the program 
tried to decipher each small dot on the page as though it were part of 
a character. As a result, it was not possible to use 600x600 resolution 
in cases where originals were speckled. Likewise, selecting the "text 
only option for the original page format forced the program to try 
to convert everything on the page into text, including imperfections 
in the image or dark binding patches. On the other hand, sometimes 
the auto brightness setting scans were so light that no text was recog- 
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nized on the page. In some cases, we spent hours trying to correct 
the settings manually. 

Proofing text. TextBridge Pro provides an optional feature that 
facilitates the correction of recognition errors. When text is scanned, 
it is possible to save "proofing text." This information is used by spe- 
cial modules that are installed into WordPerfect or Word and are im- 
plemented as macros. If the relevant module is installed and the doc- 
ument opened in the word processor, all words are highlighted that 
TextBridge Pro was unsure it recognized. The color of the highlight- 
ed word indicates the confidence TextBridge Pro has in its accuracy. 

TextBridge Pro can be taught to recognize particular fonts with a 
higher degree of accuracy through a period of training during which 
it asks the user to help it recognize ambiguous characters. We did not 
use this feature extensively. Our limited experience showed it does 
increase the likelihood of successful recognition if originals were 
poor but not truly awful. The effort is only worthwhile if the same 
types of fonts will be encountered often, which was not the case with 
the Roper Reports documentation. 

Final format. The available formats into which the output text can 
be saved depends on the options selected. There are a very large 
number of possible word processor, text, and spreadsheet formats. 
However, if "save proofing text" is selected, then the file can be 
saved in only Word or WordPerfect format. Similarly, if "reconstitute 
text" is selected, only file formats that support fairly complex format- 
ting are available. When formatting text with the "reconstitute text" 
option, TextBridge Pro wilt use some of the new "style" features of 
WordPerfect or Word in the new document. This can make subse- 
quent editing cumbersome. Though the styles themselves can be ed- 
ited, an alternative is to save in a file format that supports the text 
features that are being "reconstituted" but does not support "styles." 
In this way, the formatting will appear as regular tabs, font changes, 
and so forth, that can be directly edited. The editing of ASCII text to 
recreate the format of the original is a major undertaking and the 
time required to reformat each document is extensive. We found that 
getting pagination to match the original is particularly difficult. 

The scanned images produced by TextBridge Pro can be stored 
in CCITT-3, a compression standard, for later processing, but the re- 
sults from the subsequent processing of these images were not as 
good as those obtained from processing the images directly from the 
scanner. We decided that using these compression formats would not 
give usable results. 



PDF Files from Adobe Capture 

The next step in the documentation portion of the project was to pro- 
duce documents in the portable document format (PDF) used by 
Adobe Acrobat, a widely accepted de facto standard for encoding 
electronic documents. The viewing software provided by Adobe al- 
lows for reading and searching the text, viewing the structure of a 
document, and printing high-quality hard copy. PDF documents pro- 



er|c 



28 



20 



Ann Green, JoAnn Dionne, and Martin Dennis 



vided clear, accurate reproductions of the questionnaires. The Adobe 
Capture software produced an interim ASCII text file that could be 
edited to improve text searching. An example of a viewing screen 
may be found at the end of Appendix 6. 

Basic structure of Acrobat files. Adobe Acrobat files (distinguished 
by the PDF suffix on the file names) can contain both text and image 
information from the original document. There are different types of 
PDF files containing different kinds of information: normal PDF, im- 
age-only PDF, and image+text PDF. 

Normal PDF files, by default, display the text information de- 
rived from the OCR process. Where the text information is unknown 
(when there is a nontext picture on the page or there were difficulties 
in the OCR process), a normal PDF file will insert the original image 
into that space on the display page. Image-only PDF files are, in ef- 
fect, paginated pictures of the original pages. Like tagged image file 
format (TIFF) files, the text in these images is not searchable. Like 
image-only files, image+text PDF files show the image of the original 
pages, but also contain searchable text. 

The image+text files were chosen as the most appropriate for this 
project. The user would see a faithful reproduction of the original 
documentation (complete with handwritten notes) with the PDF 
browser, but could also search for specific text within the document. 
If text in the search function looked suspicious, a user could view the 
original image. In comparison, files produced by OCR programs con- 
tain only the text information, with no way to double-check the text 
against the image of the original. 

Adobe Acrobat Capture procedures. When scanning the document 
with the Capture software, a set of pages was scanned in sequence. 
Each page was stored as a separate TIFF file, with the filenames 
numbered sequentially. For example, a 40-page document produced 
40 TIFF files, named pageOl, page02, through page40. These original 
image files in TIFF format were stored separately from the final re- 
formatted PDF documents, providing a set of image files for digital 
storage. 

When translating the images into editable text, the TIFF files 
were concatenated into a single document during the OCR scanning 
process. Acrobat Capture analyzed the page layout and grouped text 
into regions. It then identified characters and grouped them into 
words. The words were looked up in the Acrobat dictionary (which 
can be customized) and spelling suspects noted. Fonts were analyzed 
and font suspects identified. The interim text layer of the final PDF 
file contains no image data and can be edited with the Capture Re- 
viewer so that the text matches the original document as closely as 
possible. During the OCR process, each word was assigned a confi- 
dence rating, representing the software's estimate of its OCR accuracy. 

For the searchable text of the final PDF file to be as accurate as 
possible, many OCR errors were corrected by editing the interim 
files. Fortunately, the program used to edit interim files. Acrobat 
Capture Reviewer, would highlight words whose confidence levels 
fell below a certain threshold, or that were not included in a dictio- 
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nary file. The majority of unrecognized words could be easily spot- 
ted and corrected to match the original document. Although the Re- 
viewer software allows one to change fonts and other formatting op- 
tions, the only editing necessary for the project was in the content of 
the words used for searching purposes (words used to locate terms 
in question text and variable coding). Since the user would see only 
the image reproduction of the original, the underlying ASCII text 
need to be reformatted as a visual reproduction of the original. Once 
the document was edited, it was then saved in the image+text PDF 
format. 

Time and storage requirements of PDF files. The total time required 
to process a 39-page document was approximately four hours, from 
scanning to saving the final PDF. The scanning itself took 30 minutes; 
the OCR process took 20 minutes; and editing the resulting file took 
3 hours 15 minutes. The storage space required for this document is 
shown in table 5. 



Table 5. Example of storage space required 
for Acrobat PDF files 



File Type 


Storage Space 


39 TIFF files 


1. 19 MB 


collated image file 


1.42 MB 


final PDF files: 




normal PDF (image) 


924 KB 


image+text PDF 


1.49 MB 


ASCII output file 


100 KB 



Documentation for the other nine Roper Report data files was 
also scanned and edited. Some data files with split samples had dual 
documentation. The time taken for the scanning process using a doc- 
ument feeder ranged from 5 to 30 minutes. The OCR software took 
between 15 and 35 minutes to process each document. The time tak- 
en to edit each document varied widely, from one to eight-and-a-half 
hours. The time it took to complete a single document depended 
largely on the quality of the original. Features such as background 
shadows, crooked lines of text, compressed fonts, jagged edges on 
letters, and handwriting increased both the time it took the OCR 
software to process the page, and, more importantly, the time it took 
to edit or insert accurate, searchable text. In some cases, blocks of text 
were so unrecognizable that whole questions needed to be typed in as 
hidden search text. Additional time was required for error-checking. 

Problems encountered in text recognition and editing of PDF files. Al- 
though Adobe Acrobat Preview highlights most words with low con- 
fidence levels, some forms of errors are not so easily detected during 
the editing phase. 
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• A word was not recognized as a block of text by Capture 
software during the OCR process. This means that instead of 
a text form of the word, the document included simply a bit- 
mapped image of the word from the original page. Such an 
image, of course, would not be searchable. Fortunately, these 
image blocks had some telltale signs. First, they appeared to 
be in a font that was usually quite different from the fonts 
assigned to the recognized words. Also, they were usually 
surrounded by a fine gray or blue box. After some experi- 
ence, an editor could quickly spot the image boxes as the 
page was being read. Although these image boxes could not 
be changed to text, a Reviewer command can be used to in- 
sert hidden search text underneath them. A later search 
would highlight the image as if it were a normal word. 

• A word was recognized as another, valid English word. Tn 
this case, the confidence level for the word might be quite 
high and the word would not normally be highlighted. For 
instance, if a falsely recognized word was assigned a confi- 
dence level of 97 percent, and Preview was set to highlight 
words at 95 percent or below, then the wrong word would not 
be highlighted. These words can be highlighted by raising 
the threshold setting, although at high levels (98 percent or 
99 percent), virtually every word in the document would be 
highlighted. The only practical way to discover these errors 
was for an editor to read through the document carefully. 
Once errors were spotted, the correct words would be insert- 
ed. 

• Tn rare cases, a line of text could be skipped in the OCR pro- 
cess. Again, nothing would be highlighted, but fortunately 
there would be obvious gaps in the text block where the line 
was skipped. A careful reading of the document would re- 
veal these gaps. As a corrective, a new block of text could be 
created to overlay the gap. The correct text could then be 
typed in for the purpose of future searches. 

Tn each of these cases, an attentive editor must catch the OCR 
error and make appropriate changes to the text to verify accurate 
search and retrieval. We did not edit enough documents to estimate 
the average time needed for cleaning a complete document. Future 
projects will need to budget extensive editing costs. 



HTML and SGML/XML Marked-up Files 

Conversion of scanned text to hypertext markup language (HTML) 
format would provide a more readily accessible browsing format. 
However, the text of each document would need to be fully edited 
and formatted. As indicated above, the ASCTT output from the OCR 
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technology we used could not provide us with text clean enough to 
use in HTML. Moving text into documents adhering to the standard 
general markup language/extensible markup language (SGML/ 
XML) is the most labor-intensive but also the most dynamic alterna- 
tive for text applications. SGML/XML tagging allows customized 
and robust access to specific pieces of the documentation (such as 
question text and variable location information). SGML/XML Docu- 
ment Type Definitions maintain the integrity of document content 
and structure and also define the relationships among the elements. 
The emerging social science documentation standard for both for- 
matting and content, the Data Documentation Initiative (DDI) Doc- 
ument Type Definition (DTD), provides standard elements devel- 
oped specifically for social science numeric resources. The standard 
adheres to our requirement that text be stored in a system-indepen- 
dent, nonproprietary format. Furthermore, this standard, developed 
by representatives from the international social science research com- 
munity, is intended to fill the need for a structured codebook stan- 
dard that will serve as an interchange format and permit the devel- 
opment of new Web applications. However, this format requires that 
the text be fully edited and the components of the documentation 
tagged, and funding for this work was not included in the budget for 
this project. 
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Findings and 
Recommendations 
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Upon completing the steps defined in both the data and documenta- 
tion tracks, we carefully examined the processes of migrating hard- 
ware- and software-dependent formats to independent formats. We 
also evaluated the formats in relation to their utility and ease of use 
over time. In both the data migration and the documentation migra- 
tion processes, it was imperative that the original content be pre- 
served. Migration that included recoding the content of the data files 
(changing the character or numeric equivalents in a data file) proved 
to be labor-intensive and error-prone, and produced unacceptable 
changes to the original content of the data. Editing the text output 
from the scanning process proved to be the same: error-prone, time- 
consuming, and incomplete. Therefore, recoding as a part of migra- 
tion is not recommended. However, simply copying a file in its origi- 
nal format from medium to medium (refreshing) is not enough. 

Software-dependent data file formats, such as the original col- 
umn binary files examined in the project, cannot be read without 
specific software routines. If standard software packages do not offer 
those specific routines in the future, translation programs that emu- 
late the software's reading of the column binary format could pro- 
vide a solution. However, these emulation programs will themselves 
require migration strategies over time. We offer another alternative 
for the column binary format: convert the data out of the column bi- 
nary format into ASCII without changing the coded values of the 
files. The spread ASCII format meets the criterion of software inde- 
pendence while simultaneously preserving the original content of 
the data set. It does, however, require a file-by-file migration strategy 
that would be time-consuming for a large collection of files. 

Finding a parallel solution for the documentation files is not pos- 
sible at this time. We can not accurately generate character-by-char- 
acter equivalents of the paper records. We can, however, scan the pa- 
per into digital representations that could be used in future character 
recognition technologies. The Adobe PDF image+text format does 
provide an interim solution by producing digital versions of the im- 
age and limited ASCII representation of the text. However, the types 
of documentation files produced by the Adobe process are software 
dependent. If software packages move away from the format used to 
store the image+text files, translators will be necessary to search, 
print, and display the files. We therefore recommend archiving the 
images of the printed pages in both nonproprietary TIFF format and 
PDF image+text format. 



User Evaluation 

We asked faculty and graduate students to make an informal review 
of the findings and sample output from the project. All our evalua- 
tors had previous experience using data files from the Yale Roper 
Collection. Regarding the data conversions, they expressed relief at 
not having to use column binary input statements to read the data 
files. They had no difficulty in using the chart that mapped column 
binary to ASCII in order to locate variables in the spread ASCII ver- 
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sion of the data files, once it was explained that each variable 
mapped directly to a 12-column equivalent and instructions were 
given for finding single- and multiple-punch locations. 

As for the documentation track, faculty and student reviewers 
found viewing and browsing PDF format files acceptable. Since most 
users had accessed PDF files on the Web, they seemed comfortable 
moving from the Internet browser to the Adobe Reader to locate 
question text and variable location information. This may not be the 
case with inexperienced users. These users were eager to have more 
questionnaire texts available for browsing and searching. The lack of 
a large sample collection of questionnaires did not allow evaluation 
of question text searching on a large scale. 

Findings about Data Conversion 

Column binary into SAS and SPSS. As long as software packages can 
read the SAS and SPSS export formats, recoding the column binary 
format into SAS and SPSS export files is an attractive option. These 
file formats can be used easily, are transportable to multiple operat- 
ing systems and equipment configurations, and can be transformed 
into other software-specific formats. They do, however, have a num- 
ber of drawbacks. 

First, the original data file must be recoded, a process that is 
lengthy and potentially error-prone and one that places great reli- 
ance on the person doing the translation. If that person does not ade- 
quately check for errors, annotate the documentation for irregular 
variables, or properly recode the original patterns of punches, the 
translated data set becomes inconsistent with the original. Also, 
some irregularities in the original data set, which may be meaningful 
in the analysis of the data, can become lost when the data set is 
cleaned up. For instance, the original documentation might indicate 
that a question has four possible responses: PUNCH. 1 through 
PUNCH.3 for a rating scale, and PUNCH.! 2 for a "don't know" re- 
sponse. On examination of the xray, though, it is discovered that 
about 200 people had their responses coded as PUNCH.ll, not 
PUNCH. 12. A decision must be made: will those PUNCH.ll obser- 
vations be given the same special missing value code as that for 
PUNCH.! 2? This solution will put the responses in the data set into 
accordance with the documentation by assuming that the strange 
punches were simply due to errors in data entry. On the other hand, 
the strange punches could have been intentionally entered to mark 
out those observations for special reasons that are not listed in the 
documentation. In this case, it would be better to give the PUNCH.ll 
observations a special missing value different from that for 
PUNCH.! 2. Once the recoding is done, future researchers will be un- 
able to re-create the original data set with its irregularities. 

Second, this process of reading, recoding, and cleaning data files 
to produce SAS (or SPSS) system files and export files is very time- 
consuming. For example, it took 20 hours of work to write, debug, 
and double-check a SAS program to recode the data set for Roper 
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Report 9209, which includes a split sample and a variety of variable 
types. Assuming a wage of $15 an hour for an experienced SAS pro- 
grammer, we would expect a cost of approximately $600 per data set 
for a complete job of data recoding. This estimate, of course, does not 
include the cost of consultations with data archivists about the recod- 
ing of particular variables or the cost of rewriting documentation to 
reflect the new format of the data set. 

Third, data files stored in SAS and SPSS (and other statistical 
software) formats require proprietary software to read the informa- 
tion. Although there has been an increase in programs that can read 
and transfer data files from one program, and one version of a pro- 
gram, to another, there is no guarantee that programs for specific 
versions of software will be available in the future. U.S. Census data 
from the 1970s were produced in compressed format (called Dual- 
abs) that relied on custom programs and can no longer be read on 
most of today's computing platforms. Such system- and software- 
dependent formats require expensive migration strategies to move 
them to future computing technologies. 

Spread ASCII format. If an archival standard is defined as a non- 
column binary, nonproprietary format that faithfully reproduces the 
content of the original files, only the spread ASCII format meets 
these conditions. This spread format, however, is at least 600 percent 
larger than the original file and requires converting the original col- 
umn binary structure. It also requires producing additional docu- 
mentation, since each punch listed in the original documentation 
must be assigned a new column location in the spread ASCII data. 
We recommend that each file be converted to a standard spread 
ASCII format so that a single conversion map may be used for all the 
data files (see Appendix 5 for an example from this project). Produc- 
ing such an ASCII data set has several advantages. Information is not 
lost from the original data set, because the pattern of Os and Is re- 
mains conceptually the same across data files. It is unnecessary for a 
data translator to interpose herself between the original data and the 
final user. 

However, the spread ASCII format is not perfect. The storage re- 
quirements for spread ASCII data are on a par with the size require- 
ments for a SAS data set containing both recoded and intermediate 
variables. For the overhead in storage cost, the SAS data set at least 
provides internally referenced variable names and meaningful vari- 
able values. The size of the spread ASCII data becomes even more 
apparent when it is contrasted with the size of a recoded ASCII data 
set. In a recoded data set, each unused bit can be left out of the final 
data; particular bits in a column need not even be input if they con- 
tain no values for the variable, and columns without variables may 
also be skipped. In contrast, in the spread ASCII data, each bit is in- 
put and translated to a character, whether it is used or not. 

The spread ASCII format requires that users must know how the 
variables, particularly the multiple-response variables, relate to the 
12-column equivalent. A spread ASCII data set is not as easily used 
as a fully recoded one; after all, recoding the Os and Is into usable 
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variable values will fall to the end user with ASCII data, just as it 
currently does with column binary data. Finally, each punch listed in 
the original documentation must be assigned a new column location 
in the spread ASCII data. Users must refer to an additional piece of 
documentation, the ASCII data map, to locate data of interest, and 
this extra step inevitably creates some initial confusion. 

Hybrid spread ASCII. The hybrid spread ASCII format, distribut- 
ed by the Roper Center and the Institute for Research in Social Sci- 
ence at University of North Carolina, offers another alternative. The 
original data are stored in column binary format in the first horizon- 
tal layer of the file. This format preserves the original structure of the 
data file in the first part of a new record. A second horizontal layer of 
converted data, in spread ASCII format, is added to the new record. 
The primary advantage of this hybrid spread ASCII data file format 
is that users can access the nonbinary portions of the original data 
file in their original column locations as indicated on the question- 
naires. However, users have to know whether the question and vari- 
ables of interest were coded in the binary format in order to deter- 
mine whether to read the first part of the record or the second. The 
ASCII codes in the first horizontal layer are readable, but any binary 
coding in that first horizontal layer are not, since the binary coding is 
converted to ASCII to avoid problems while reading the data with 
statistical software. If users want to read data that were coded in bi- 
nary in the original file, they can read the spread ASCII version of 
multiple-punched equivalents in the second horizontal layer without 
having to learn the column binary input statements to read the data. 

We chose to produce converted ASCII files that did not have this 
two-part structure. To use the spread ASCII data files we construct- 
ed, users first determine the original column location of a particular 
variable from the questionnaire, and then use a simple data map to 
locate the column location in the new spread ASCII file (see Appen- 
dix 5). 

Original column binari/ format. The column binary format itself 
turned out to be more attractive as a long-term archival standard 
than we had anticipated. It conserves space, it preserves the original 
coding of data and matches the column location information in the 
documentation, it can be transferred among computers in standard 
binary, and it can be read by standard statistical packages on all of 
the platforms we used in testing. Unfortunately, it is difficult to lo- 
cate and decipher information about how to read column binary data 
with SAS and SPSS, as the latest manuals (for PC versions of the soft- 
ware) no longer contain supporting information about this format. 
This lack of documentation support indicates the possibility that the 
input formats will not be offered in subsequent software versions. 

On the other hand, as long as the format exists, there seems to be 
some level of commitment to support it. As stated in the SAS Lan- 
guage: Reference text, 'T)ecause multipunched decks and card-image 
data sets remain in existence... the SAS System provides informats 
for reading column-binary data." (SAS 1990, 38-9). If the column bi- 
nary format is refreshed onto new media and preserved only in its 
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original form, we recommend that sample programs for reading the 
data with standard statistical packages, or a stand-alone translation 
program, be included in the collection of supporting files. But even 
with these supplemental programs, accessing and converting the 
files will continue to present significant challenges to researchers. 
Given these considerations, we do not recommend this format over 
the spread ASCII alternative. 

Findings about Documentation Conversion 

The OCR output files from TextBridge Pro did not provide us with 
an adequate means for archiving the textual material. The question- 
naires we scanned had an unacceptable rate of character recognition, 
including incorrect location information necessary for manipulating 
the accompanying data files. Handwritten notes were completely 
lost. Although the format allowed for searching of a particular text 
that was successfully recognized, the amount of editing required to 
produce a legible version of the original, review the output, and cor- 
rect all errors was found to be prohibitive. Subsequent viewing was 
poor without the formatting capabilities of proprietary word pro- 
cessing software. 

The PDF format provided solutions to some of the documenta- 
tion distribution and preservation problems we faced, but it did not 
meet all of our needs. For one thing, the format does not go far 
enough in providing internal structure for the manipulation, output, 
and analysis of the metadata. Like a tagged MARC record in an on- 
line public access catalog, full-text documentation for numeric data 
requires specific content tagging to allow search, retrieval, manipula- 
tion, and reformatting of individual sections of the information. An- ' 
other drawback is that PDF files are produced and stored in a format 
that may be difficult to read and search in the future. The PDF for- 
mat, although a published standard, depends on proprietary soft- 
ware that may not be available in future computing environments. 

(A similar problem can be seen with dBase data files that are rapidly 
becoming outmoded, causing major problems with large collections 
of CD-ROM products distributed by the U.S. government.) We see 
increasing numbers of PDF documents distributed on the Internet 
and the format will be used by ICPSR for the distribution of ma- 
chine-readable documentation to its member institutions. So, given 
the large number of PDF files in distribution, software for conversion 
will most likely be developed over time. However, the current popu- 
larity of the PDF format does not guarantee that software to read it 
will continue to be available throughout the future of technological 
evolution. 

The TIFF graphic image file format is useful for viewing and for 
distribution on the Web and as an intermediate archival format, al- 
lowing storage of files until they can be processed in the future using 
more advanced OCR technology. Even though this format does not 
allow text searching, tagging/mark up, or editing, it moves the en- 
dangered material into digital format. ICPSR has decided to retain 
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such digital images of its printed documentation collection for repro- 
cessing as OCR technology evolves. 

It is our recommendation that both PDF/Adobe Capture edited 
output and scanned image files be produced and archived. The 
PDF/ Adobe images text files allow searching and viewing of text in 
original formatting. The scanned image files can be archived for fu- 
ture character recognition and enhancement. We also want to em- 
phasize the importance of the long-term development of tagged doc- 
umentation using the DDI format. This is by far the most desirable 
format, albeit one that is difficult to produce from printed documen- 
tation, given the inadequacy of OCR technology and the costs of sub- 
sequent editing. We urge future producers and distributors of nu- 
meric data to help develop and adhere to standard system- 
independent documentation. 



Recommendations to Data Producers 

Design affects maintenance costs and long-term preservation. Producers of 
statistical data files need to be cognizant of preservation strategies 
and the importance of system-independent formats that can be mi- 
grated through generations of media and technological applications. 
In this project, we had significant problems with the column binary 
format of the data. Had long-term maintenance plans been consid- 
ered, and the costs of migration been taken into consideration, the 
creators of the data format might not have chosen the column binary 
format. At the time, however, it was the most compact format to use 
and standard software could then be adapted to the format. 

One thing is very clear: data producers would be advised and 
should be persuaded to take long-term maintenance and preserva- 
tion considerations into account as they create data files and as they 
design value-added systems. Our experience shows that the most 
simple format is the best long-term format: the flat ASCII data file. 

We would urge producers to provide column-delimited ASCII files, 
accompanied by complete machine-readable documentation in 
nonproprietary and nonplatform-specific formats. Programming 
statements (also known as control cards) for SAS and SPSS are also 
highly recommended so that users can convert the raw data into sys- 
tem files. The control card files can be modified for later versions of 
statistical software and used for other programming applications 
and indexing. 

In accordance with emerging standards for resource discovery, 
data files should contain a standard electronic header or be accompa- 
nied by machine-readable metadata identification information. This 
information should include complete citations to all the parts of a 
particular study (data files, documentation, control card files, and so 
forth) and serve as a study-level record of contents and structure. 

Metadata standards. Not only must standards be considered for 
the structure of the numeric data files, but the metadata — informa- 
tion describing the data — must also conform to content and format 
standards for current use and for long-term preservation applica- 
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tions. Content standards require common elements, or a common set 
of "containers" that hold specific types of information. Standards for 
coding variables should be followed so that linkages and cross-study 
analysis are enhanced, and common thesauri for searching and min- 
ing data collections also need to be produced. As for format stan- 
dards, metadata should be produced in system-independent formats 
that provide standard structures for the common elements and cod- 
ing schemes. These format standards should provide consistent tag- 
ging of elements that can be mapped to resource discovery and 
viewing software and to statistical analysis and database systems. 

For most social science data files, machine-readable documenta- 
tion should be supplied in ASCII format that conforms to standard 
guidelines for both content and format. With some very large sur- 
veys using complex survey instruments, files in this ASCII format 
may be so big that complex structures in nonproprietary format need 
to be developed to reduce storage requirements. If paper documenta- 
tion is distributed, it should be produced using high-quality duplica- 
tion techniques with simple fonts, no underlining, no handwritten 
notations, and plenty of white space. 

Of particular interest is the Data Documentation Initiative (DDI) 
DTD (Document Type Definition in XML), which is a developing 
standard for documentation describing numeric databases. It pro- 
vides both content and format standards for creating digitized docu- 
mentation in XML. Data producers in the United States, Canada, and 
Europe will be testing the DTD as part of their documentation pro- 
duction process. Data archives will be converting their digitized doc- 
umentation into this format, and some will be scanning paper docu- 
mentation and tagging the content with the DDI. 

Complex statistical systems. We must also be concerned about the 
long-term preservation plans for complex systems. As we see more 
efforts to integrate data and documentation within linked systems, 
we see a growing tension between access and preservation. Complex 
database management systems such as ORACLE or SQL present us 
with more complex questions: What parts of the system need to be 
preserved to save the content of the information as well as its integri- 
ty? Will snapshots of the system provide future users with enough 
information to simulate access as we see it today? Is the content us- 
able outside the context of the system? These are the database preser- 
vation challenges of the future. 
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ASCII — American Standard Code for Information Interchange. A 
character encoding scheme used by many computers. The ASCII 
standard uses 7 of the 8 bits that make up a byte to define the codes 
for 128 characters. Example: in ASCII, the number seven is a treated 
as a character and is encoded as: 00010111. Because a byte can have a 
total of 256 possible values, there are an additional 128 possible char- 
acters that can be encoded into a byte, but there is no formal ASCII 
standard for those additional 128 characters. Most IBM-compatible 
personal computers do use an IBM extended character set that in- 
cludes international characters, line and box drawing characters, 
Greek letters, and mathematical symbols. 

C — A programming language. 

CBw;.rf Instructions that the SAS System uses to read standard nu- 
meric values from column-binary files, translating the data into stan- 
dard binary format. The w value specifies the width of the variable, 
usually 8, but has a range between 1 and 32. The d value specifies the 
number of digits to the right of the decimal point in the numeric value. 

card — Also known as deck, a physical record of data. A survey may 
have multiple cards for each respondent, all cards together compris- 
ing a logical record. Based on the IBM punch cards of 80-column 
length. 

case— The unit of analysis in a particular data file. Can be an individ- 
ual respondent to a questionnaire, a customer, or an industry. In the 
Roper Reports, each case is an interview respondent. 

codebook — Description of the organization and content of a data 
file. Contains the code ranges and the code meanings needed to in- 
terpret the data file. 

column binary — A code originally used with punched cards in 
which successive bits are represented by the presence or absence of 
punches in contiguous positions in columns. Using this method, re- 
sponses to more than one question can be stored in a single column. 

data dictionary — A file, part of a file, or part of a printed codebook 
containing information about a data file, including the name of the 
element, its format, location, and size. 

Data Documentation Initiative (DDI) — An international committee 
sponsored by ICPSR that is developing a new metadata standard for 
social science documentation. This standard, developed by represen- 
tatives from the international social science research community, is 
intended to fill the need for a structured codebook standard that will 
serve as an interchange format and permit the development of new 
Web applications. The Document Type Definition (DTD) for the DDI 
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standard is written in XML (Extensible Markup Language) and is 
available at http://www.icpsr.umich.edu/DDI/. 

deck — Also known as card, a logical record of data. A survey may 
have multiple cards for each respondent, all cards together compris- 
ing a logical record. Based on the IBM punch cards of 80-column 
length. 

documentation — Information that accompanies a data file, describ- 
ing the condition of the data, the creation of the file, the location and 
size of variables in the file, and the values (or codes) of the variables. 

export file — A file produced by a software package that is designed 
to be read on another computer, often with a different operating sys- 
tem, running a version of the same software package. 

HTML — HyperText Markup Language 

ICPSR — Inter-university Consortium for Political and Social Research 

infonnat — The instructions that specify how SAS reads the numbers 
and characters in a data file. 

intermediate variable — A variable used when recoding data to input 
information from individual punches in multipunch data. Sets of in- 
termediate variables are then recoded to produce final variables. 

logical record — A complete unit of data for a particular unit of anal- 
ysis, in this project a single respondent. Multiple physical records, 
called cards or decks, may make up a logical record. 

missing value — A value code that indicates no data are present for a 
variable for a particular case. To be distinguished from non-response 
values (respondent refused to answer or was not asked the question) 
and from invalid responses (the response did not have a valid value 
code equivalent). Non-responses and invalid responses may or may 
not have value categories provided in the questionnaire and may be 
treated differently from true missing data during analysis. 

multipunched — A way of recording data, originally used with 
punched cards, in which successive bits are represented by the pres- 
ence or absence of punches in contiguous positions in columns. Us- 
ing this method, responses to more than one question can be stored 
in a single column. 

OCR — Optical character recognition 

PDF — Portable Document Format, a published standard format de- 
veloped by Adobe Systems, accessed with proprietary software. 
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punch card — A paper medium used for recording computer-read- 
able data. The card is punched by a special machine called a key- 
punch that works like a typewriter, except that it punches holes in 
cards instead of typing characters on paper. The punch cards are 
then processed with a card reader that transfers the punched infor- 
mation to a computer-readable digital format. 

PUNCH.d — Instructions that the SAS system uses to read standard 
numeric values from column binary files. The d value specifies which 
row in a card column to read. Valid values for the d value are 1 
through 12. 

questionnaire — The set of questions asked in a survey. In the Yale 
Roper Collection, the questionnaire, with columns and codes written 
next to the question, may substitute for a codebook. 

recode — Changing the value code of a variable from one value to an- 
other. For example, changing 0 and 1 values in column binary data 
files to value ranges of 0 through 12. Also known as data transforma- 
tion. 

respondent — In survey research, the person responding to the sur- 
vey questions, 

KOWiv.d — Instructions that the SAS system uses to read a column- 
binary field down a card column. The w value specifies the row 
where the field begins, with a range between 1 and 12. The d value 
specifies the length in rows of the field. Valid values for d are 1 
through 25, with the default value of 1. The informat assigns the rela- 
tive position of the punch in the field range to a numeric variable. 

SAS — Set of proprietary computer programs used for analysis of so- 
cial science statistical data. (No longer an acronym; originally stood 
for Statistical Analysis System.) 

SGML — Standard General Mark-Up Language 

SPSS — Statistical Package for the Social Sciences. Set of proprietary 
computer programs used for analysis of social science statistical 
data. 



single-punch — A single response coded in a column. 

split sample — A method of data collection in which one group of 
respondents is queried with one form of a questionnaire and the sec- 
ond group is queried with a different form of the questionnaire. 

spread — Recoding multiple responses that have been coded in a sin- 
gle column of a record to a separate column for each response. 
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system file — A data file or collection of data files specifically format- 
ted for a particular software package; may not be readable by other 
software packages. 

TIFF — ^Tagged Image File Format 

values — The numeric or character equivalents for a particular vari- 
able in a data file. 

variable — An item in a data file to which a value has been assigned. 
A data file contains the values of certain variables measured for a set 
of cases. In the Roper Report data files, variables are responses to 
questions or parts of questions from each person interviewed. 

XML — Extensible Markup Language (XML) is a data format for 
structured document interchange on the Web. 

xray — A form of output that is organized by card, column, and row; 
each bit has its own unique location within this framework. The total 
number of punched bits across all observations is recorded for each 
location in the data set. This sum often provides a response frequen- 
cy for individual response options. 
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Appendix 1: Roper Report documentation page 3W: Questions 7-9 
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7. And thinking about crime in the United States, what one type of crime do you 

feel presents the biggest threat for you and your family today? (HAND RESPON- 



DENT CARD) 

a. Burglary, robbery, auto theft 1 45/ 

' b. Vandalism/hooliganism 2 

c. Official corruption, government bribe-taking 3 

d. Murder 4 

e. Rape 5 

f . Assault 6 

g. Racketeering, extortion 7 

h. Drugs 8 

i. Prostitution 9 

j. Speculation and swindling 0 

k. Other 1 46/ 

None of the above (vol . ) 2 

Don * t know Y 



8. Of course, the problems we face in 
our society often have many differ- 
ent causes. I am going to ask you 
about some different problems in 
society and for each one I want you 
to tell me what you think are the 
major causes of each problem. 

First, violent crime. (HAND RESPON- 
DENT CARD) Using this card, which 
three or four things do you think 
are the main causes of people com- 
mitting violent crimes? 

VIOLENT 

CRIME 



a. 

b. 

c. 

d. 



a. 

b. 

c. 

d. 

e. 

f. 

g- 

h. 

i. 

j- 

k. 

l . 

m. 

n. 

o. 
P- 







Living in poverty 1 

Parents not teaching 

right from wrong 2 

Being abused as a child 3 

Drug abuse 4 

What people see in TV programs 5 

A lack of morals 6 

A lack of education 7 

A person not seeing any harm in it . . 8 

A person being irresponsible 9 

Influence of friends 0 

Alcohol abuse X 

What people see in movies Y 

The advertising and marketing 

of toy guns , etc 1 

Guns being too easy to get 2 

Low chance of being punished 3 

Seeing pornography 4 

None of these 5 

Don ’ t know Y 



47/ 



48/ 



f 



e. 

f. 

g. 

h. 

i. 

j- 

k. 



1 . 



P- 



What about drunk driving? (HAND 
RESPONDENT' CARD) Using this card, 
which three or four things do you 
think are the main causes of people 
becoming drunk drivers? 

DRUNK 

DRIVING 



Living in poverty 1 49/ 

Parents not teaching 

right from wrong 2 

Being abused as a child - 3 

Drug abuse 4 

What people see in TV programs • 5 

A lack of morals 6 

A lack of education ; 7 

A person not seeing any harm in it . . 8 

A person being irresponsible 9 

Influence of friends 0 



Being unable to control drinking 



because alcoholism is a 

disease X 

What people see in movies Y 

What people see in advert ising 1 50/ 

Bartenders not being responsible .... 2 

Low chance of being punished 3 

Friends and family not looking 

out for each other 4 

None of these 5 

Don ’ t know Y 

47 



Preserving the Whole 



39 



Appendix 2: Sample SAS input and recode statements 

I* 

This snippet of a SAS program consists of two parts: the first part is an example of reading in column bi- 
nary data, the second is an example of inputting the same data in spread ASCII format 

V 

DATA ONE; 

[NFtLE IN LRECL=160 RECFM=F; 

INPUT 

#1@26Q10_1 PUNCH.l 
@26Q10_2 PUNCH.2 
@26Q10_Y PUNCH.l 2 
#9 @5 Q101_l PUNCH.l 
@5 Q101_2 PUNCH.2 
@5 Q101_3 PUNCH.3 
@5 Q101_XPUNCH.11 
@5 Q101_YPUNCH.ll; 

IF Q10_l=l AND Q10_2=0 AND Q10_Y=0 THEN Q10=l; 

IFQ10_1=0 AND Q10_2=1 AND Q10_Y=0 THEN Q1 0=2; 

IF Q10_l=0 AND Q10_2=0 AND Q10_Y=1 THEN Q10=99; 

IF Q101_l=l ANDQ101_2=0ANDQ101_3=0ANDQ101_X=0ANDQ101_Y=0 
THEN Q101=l; 

IF Q101_l=0 AND Q101_2=l AND Q101_3=0 AND Q101_X=0 AND Q101_Y=0 
THEN Q101=2; 

IF Q101_l=0 AND Q101_2=0 AND Q101_3=l AND Q101_X=0 AND Q101_Y=0 
THEN Q1 01 =3; 

IF Q101_l=0 AND Q101_2=0 AND Q101_3=0 AND Q101_X=1 AND Q101_Y=0 
THEN Q101=4; 

IF Q101_l=0 AND Q101_2=0 AND Q101_3=0 AND Q101_X=0 AND Q101_Y=1 
THEN Q101=99; 

DATA ONE; 

INFILE IN LRECL=960 RECFM=F; 

INPUT 

#1 @301 Q10_l 
@302 Q10_2 
@312Q10_Y 
#9 @49 Q101_l 
@49 Q101_2 
@49 Q101_3 
@49 Q101_X 
@49 Q101_Y; 

IF Q10_l=l AND Q10_2=0 AND Q10_Y=0 THEN Q10=l; 

IF Q10_l=0 AND Q10_2=l AND Q10_Y=0 THEN Q10=2; 

IF Q10_l=0 AND Q10_2=0 AND Q10_Y=1 THEN Q10=99; 

IF Q101_l=l AND Q101_2=0 AND Q101_3=0 AND Q101_X=0 AND Q101_Y=0 
THENQ101=1; 

IFQ101_1=0 AND Q101_2=l AND Q101_3=0 AND Q101_X=0 AND Q101_Y=0 
THEN Q101=2; 

IF Q101_l=0 AND Q101_2=0 AND Q101_3=l AND Q101_X=0 AND Q101_Y=0 
THENQ101=3; 

IF Q101_l=0 AND Q101_2=0 AND Q101_3=0 AND Q101_X=1 AND Q101_Y=0 
THEN Q101=4; 

IF Q101_l=0 AND Q101_2=0 AND Q101_3=0 AND Q101_X=0 AND Q101_Y=1 
THEN Q101=99; 



BEST COPY AVAILABLE 



40 



Ann Green, JoAnn Dionne, and Martin Dennis 



Appendix 3: Data conversion formats and storage requirements 



Platform 


Data set Description 
(Sept. 1993 Roper Report) 


Storage 
Requirements 
(in bytes) 


Percentage of 
Original 


All 

platforms 


Original (column-binary) 


2,858,400 




IBM PC 
data sets 


SAS data set (full set of variables) 


19,464,984 


681 




SAS XPORT data set (full set of variables) 


19,144,720 


670 




SAS data set with 4 byte integers (full set of 
variables) 


9,978,112 


349 




SAS data set with 3 byte integers (full set of 
variables) 


7,528,192 


263 




SAS data set (partial set of variables) 


8,879,360 


311 




SAS XPORT data set (partial set of variables) 


8,843,760 


309 




SAS data set with 4 byte integers (partial set of 
variables) 


4,571,392 


160 




SAS data set with 3 byte integers (partial set of 
variables) 


3,471,616 


121 


Mainframe 
data sets 


SAS data set (full set of variables) 


23,347,200 


817 




SAS XPORT data set (full set of variables) 


19,144,720 


670 




SAS data set (partial set of variables) 


11,059,200 


387 




SAS XPORT data set (partial set of variables) 


8,843,760 


309 
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Appendix 4: Programs to create spread ASCII datasets 

Part one: The SAS Macro 

%INCLUDE 'j:\statlab\roper\data\spread.mac'; 

* ALWAYS INDICATE FULL PATH & DATASET NAME OF 

* COLUMN-BINARY DATA (DSN=) AND THE FULL PATH 

* & DATASET NAME OF THE SPREAD DATA (OUT=) ^ 

* IN THE MACRO CALL 

%SPREAD(DSN=h:\ssda\roper\re ports\rprr9309\rprr9309.bin, 
OUT=h:\ssda\roper\reports\rprr9309\rprr9309.spr); 

RUN; 



Part two: Read in column binary data, convert to ASCII; create data map showing new column locations 

%MACRO SPREAD(DSN=,OUT=:); 

OPTION ERRORS = 0; /* CHANGES INVALID DATA MESSAGES TO WARNINGS */ 

OPTIONS NONUMBER NODATE; 
options ps=58; 

DATA_NULL_; 

INFILE "&DSN" Ired=l60 recfm=:f; 

FILE "&OUT" lred=960 ; 
length a $1; 

DO i=l TO 80; 

INPUT @i A $CB1 . @i rowl punch. 1 
@i row2 punch. 2 
@i row3 pundi.3 
@i row4 punch.4 
@i row5 punch.5 
@i row6 punch. 6 
@i ro w7 punch. 7 
@i row8 punch. 8 
@i row9 pundi.9 
@i rowlO punch.O 
@i rowll punch.ll 
@i rowl 2 punch.l2 @; 

PUT @(1 +(12*0-1))) (rowl -rowl 2) (1.) @; 

END; 

put; 

options ls=80; 
data _null_; 
file print; 

title DATA MAP FOR COLUMN-BINARY SPREAD DATA; 
titles NOTE: Rows 1-9,0,X,Y correspond to columns 1-12.; 
put; put; put; 
do i = 1 to 40; 
coll = 1 + (12*0-1)); 
col2 = 12 + (12*(i-D); 
col3 = 1 + (12*(40+i-l)); 
col4 = 12 + (12*(40+i-D); 
cola = i; 
colb = 40 + i; 

put @1 "Column " cola 2. " maps to " coll 4. " through " col2 4. @; 
put @41 "Column " colb 2. " maps to " col3 4. " through " col4 4. @; 
put; 
end; 

run; 
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%MEND; 
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Appendix 5: Data map for column binary spread data 



NOTE: Rows 1-9,0,X,Y correspond to columns 1-12. 



Column 1 maps to 1 through 12 
Column 2 maps to 13 through 24 
Column 3 maps to 25 through 36 
Column 4 maps to 37 through 48 
Column 5 maps to 49 through 60 
Column 6 maps to 61 through 72 
Column 7 maps to 73 through 84 
Column 8 maps to 85 through 96 
Column 9 maps to 97 through 108 
Column 10 maps to 109 through 120 
Column 11 maps to 121 through 132 
Column 12 maps to 133 through 144 
Column 13 maps to 145 through 156 
Column 14 maps to 157 through 168 
Column 15 maps to 169 through 180 
Column 16 maps to 181 through 192 
Column 17 maps to 193 through 204 
Column 18 maps to 205 through 216 
Column 19 maps to 217 through 228 
Column 20 maps to 229 through 240 
Column 21 maps to 241 through 252 
Column 22 maps to 253 through 264 
Column 23 maps to 265 through 276 
Column 24 maps to 277 through 288 
Column 25 maps to 289 through 300 
Column 26 maps to 301 through 312 
Column 27 maps to 313 through 324 
Column 28 maps to 325 through 336 
Column 29 maps to 337 through 348 
Column 30 maps to 349 through 360 
Column 31 maps to 361 through 372 
Column 32 maps to 373 through 384 
Column 33 maps to 385 through 396 
Column 34 maps to 397 through 408 
Column 35 maps to 409 through 420 
Column 36 maps to 421 through 432 
Column 37 maps to 433 through 444 
Column 38 maps to 445 through 456 
Column 39 maps to 457 through 468 
Column 40 maps to 469 through 480 



Column 41 maps to 481 through 492 
Column 42 maps to 493 through 504 
Column 43 maps to 505 through 516 
Column 44 maps to 517 through 528 
Column 45 maps to 529 through 540 
Column 46 maps to 541 through 552 
Column 47 maps to 553 through 564 
Column 48 maps to 565 through 576 
Column 49 maps to 577 through 588 
Column 50 maps to 589 through 600 
Column 51 maps to 601 through 612 
Column 52 maps to 613 through 624 
Column 53 maps to 625 through 636 
Column 54 maps to 637 through 648 
Column 55 maps to 649 through 660 
Column 56 maps to 661 through 672 
Column 57 maps to 673 through 684 
Column 58 maps to 685 through 696 
Column 59 maps to 697 through 708 
Column 60 maps to 709 through 720 
Column 61 maps to 721 through 732 
Column 62 maps to 733 through 744 
Column 63 maps to 745 through 756 
Column 64 maps to 757 through 768 
Column 65 maps to 769 through 780 
Column 66 maps to 781 through 792 
Column 67 maps to 793 through 804 
Column 68 maps to 805 through 816 
Column 69 maps to 817 through 828 
Column 70 maps to 829 through 840 
Column 71 maps to 841 through 852 
Column 72 maps to 853 through 864 
Column 73 maps to 865 through 876 
Column 74 maps to 877 through 888 
Column 75 maps to 889 through 900 
Column 76 maps to 901 through 912 
Column 77 maps to 913 through 924 
Column 78 maps to 925 through 936 
Column 79 maps to 937 through 948 
Column 80 maps to 949 through 960 
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Appendix ©: Roper Report documentation page 4 W/Y: Question 10 

photocopy 

PAGE 4 W/Y - CD-I 

10. People have strong feelings a±>out some issues, and not so strong about others. 
On this caxd (HAND RESPONDENT CARD) there is a scale of feelings~-absolutely 
delighted, pleased, somewhat satisfied, no real feelings one way or the other, 
somewhat dissatisfied, angry, and boiling mad. Using this scale, how would you 
describe your feelings when you think about: (ASK ABOUT EACH) 







-L. 

ABSX>- 

LUTELY 

DE- 

LIGHTED 


■2^ 

PLEASED 


-Lu 

SOME- 

WHAT 

SATIS- 

FIED 


4^ 

/JD 

REAL 

FEELr- 

WGS 


SOME- 

WHAT 

DIS- 

SATIS- 

FIED 


ANO^Y 


2s. 

BOILUJG 

MAD 


DON'T 

KNOW 


a. 


The way things are going 
for you personally these 
days 


. . . 1 


2 


3 


4 


5 


6 


1 


Y 


51/ 


b. 


The way things are going 
for the country 
generally 


. . . 1 


2 


3 


4 


5 


6 


1 


Y 


52/ 


c. 


How President Clinton 
is dealing with the 
ecx>norry 


- . 1 


2 


3 


4 


5 


6 


1 


Y 


53/ 


d. 


How Congress is dealing 
with the econony 


. . 1 


2 


3 


4 


5 


6 


■ 1 


Y 


54/ 


e. 


How your state government 
is dealing with the 
state *s economy 


. . 1 


2 


3 


4 


5 


6 


1 


Y 


55/ 


f. 


How secure your job is .. 


. . 1 


2 


3 


4 


5 


6 


1 


Y 


56/ 


g- 


Your health care costs .. 
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